{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NER response\n",
    "\n",
    "### LSTM vs BiLSTM Result\n",
    "I used precision, recall and F1 score as the metric. BiLSTM performs better than LSTM in the NER task since for each word considering the previous and the next words are important to understand its context. However, both models have overfitting problem. This overfitting here can be caused by lack of features and need more feature engineering(discussed in further work).\n",
    "\n",
    "### LSTM vs RNN\n",
    "The advantage of using an LSTM over a vanilla RNN is that LSTM can maintain long-term information by training the memory cell. While both methods train on a sequential dataset, a vanilla RNN easily losses long-term information due to vanishing gradients. The sigmoid function in forget and input/output gates ensures that the small gradients in between two highly correlated words would not corrupt the relationship. With the gates, LSTM achieves higher prediction accuracy than a vanilla RNN in text analysis.\n",
    "\n",
    "### BiLSTM vs LSTM \n",
    "The advantage of using a BiLSTM is that it learns additional future information with the whole context. BiLSTM combines the parameters trained from a positive and a reversed sequential direction to make predictions. Therefore, it preserves information from both the past and the future while a vanilla LSTM only predicts based on the past. This is especially important for the NER project.\n",
    "BiLSTM is applicable on this assignment because this task does not require online training. We already have the whole context when classifying name entities.\n",
    "\n",
    "### Further Work\n",
    "The performance of embedding + LSTM/BiLSTM is not very good on the validation dataset. From the research work by Jason Chiu and Eric Nichols, adding the casing and character information and applying CNN after LSTM will significantly improve the performance. I have tested the two embedding methods, one using the trained embedding from GloVE and the other using embedding trained on this dataset itself. The performance was pretty close and does not address the overfitting problem. The further work to improve the NER classification would be considering capitalization feature, detecting words with LSTM and characters using character-level CNNs.\n",
    "\n",
    "Reference: Jason P.C. Chiu, Eric Nichols. \"Named Entity Recognition with Bidirectional LSTM-CNNs\".arXiv:1511.08308"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import tqdm\n",
    "from keras.layers import TimeDistributed,Conv1D,Dense,Embedding,Input,Dropout,LSTM,Bidirectional,MaxPooling1D,Flatten,concatenate\n",
    "from keras.models import Model\n",
    "from keras.utils import Progbar\n",
    "from util import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_processed, train_processed_len, wordEmbeddings, label2Idx = readfile('data/ner_train.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "val_processed, _, _, _ = readfile('data/ner_validation.txt')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# BiLSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "words_input (InputLayer)     (None, None)              0         \n",
      "_________________________________________________________________\n",
      "embedding_4 (Embedding)      (None, None, 100)         1841700   \n",
      "_________________________________________________________________\n",
      "bidirectional_3 (Bidirection (None, None, 600)         962400    \n",
      "_________________________________________________________________\n",
      "time_distributed_4 (TimeDist (None, None, 10)          6010      \n",
      "=================================================================\n",
      "Total params: 2,810,110\n",
      "Trainable params: 968,410\n",
      "Non-trainable params: 1,841,700\n",
      "_________________________________________________________________\n",
      "14039/14041 [============================>.] - ETA: 0strain data precision: 0.897, Rec: 0.880, F1: 0.888\n",
      "3246/3250 [============================>.] - ETA: 0sValidation data precision: 0.130, Rec: 0.088, F1: 0.105\n"
     ]
    }
   ],
   "source": [
    "# building the BiLSTM model\n",
    "words_input = Input(shape=(None,),dtype='int32',name='words_input')\n",
    "words = Embedding(input_dim=wordEmbeddings.shape[0], output_dim=wordEmbeddings.shape[1],  weights=[wordEmbeddings], trainable=False)(words_input)\n",
    "output = Bidirectional(LSTM(300, return_sequences=True, dropout=0.5, recurrent_dropout=0.5))(words)\n",
    "output = TimeDistributed(Dense(len(label2Idx), activation='softmax'))(output)\n",
    "model = Model(inputs=words_input, outputs= output)\n",
    "model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam')\n",
    "model.summary()\n",
    "\n",
    "# training\n",
    "epochs = 20\n",
    "for epoch in range(epochs):    \n",
    "    #print(\"Epoch %d/%d\"%(epoch,epochs))\n",
    "    a = Progbar(len(train_processed_len))\n",
    "    for i,j in enumerate(iterate_minibatches(train_processed,train_processed_len)):\n",
    "        labels, tokens = j      \n",
    "        model.train_on_batch(tokens, labels)\n",
    "        a.update(i)\n",
    "\n",
    "idx2Label = {v: k for k, v in label2Idx.items()}\n",
    "\n",
    "#  Performance on training dataset        \n",
    "predLabels, correctLabels = tag_dataset(train_processed, model)        \n",
    "prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)\n",
    "print(\"train data precision: %.3f, Rec: %.3f, F1: %.3f\" % (prec_val, rec_val, f1_val))\n",
    "\n",
    "#  Performance on validation dataset        \n",
    "predLabels, correctLabels = tag_dataset(val_processed, model)        \n",
    "prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)\n",
    "print(\"Validation data precision: %.3f, Rec: %.3f, F1: %.3f\" % (prec_val, rec_val, f1_val))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "_________________________________________________________________\n",
      "Layer (type)                 Output Shape              Param #   \n",
      "=================================================================\n",
      "words_input (InputLayer)     (None, None)              0         \n",
      "_________________________________________________________________\n",
      "embedding_5 (Embedding)      (None, None, 100)         1841700   \n",
      "_________________________________________________________________\n",
      "lstm_5 (LSTM)                (None, None, 300)         481200    \n",
      "_________________________________________________________________\n",
      "time_distributed_5 (TimeDist (None, None, 10)          3010      \n",
      "=================================================================\n",
      "Total params: 2,325,910\n",
      "Trainable params: 484,210\n",
      "Non-trainable params: 1,841,700\n",
      "_________________________________________________________________\n",
      "14036/14041 [============================>.] - ETA: 0strain data precision: 0.825, Rec: 0.808, F1: 0.817\n",
      "3248/3250 [============================>.] - ETA: 0sValidation data precision: 0.106, Rec: 0.077, F1: 0.089\n"
     ]
    }
   ],
   "source": [
    "# building the LSTM model\n",
    "words_input = Input(shape=(None,),dtype='int32',name='words_input')\n",
    "words = Embedding(input_dim=wordEmbeddings.shape[0], output_dim=wordEmbeddings.shape[1],  weights=[wordEmbeddings], trainable=False)(words_input)\n",
    "output = LSTM(300, return_sequences=True, dropout=0.5, recurrent_dropout=0.5)(words)\n",
    "output = TimeDistributed(Dense(len(label2Idx), activation='softmax'))(output)\n",
    "model = Model(inputs=words_input, outputs= output)\n",
    "model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam')\n",
    "model.summary()\n",
    "\n",
    "epochs = 20\n",
    "for epoch in range(epochs):    \n",
    "    #print(\"Epoch %d/%d\"%(epoch,epochs))\n",
    "    a = Progbar(len(train_processed_len))\n",
    "    for i,j in enumerate(iterate_minibatches(train_processed,train_processed_len)):\n",
    "        labels, tokens = j      \n",
    "        model.train_on_batch(tokens, labels)\n",
    "        a.update(i)\n",
    "        \n",
    "idx2Label = {v: k for k, v in label2Idx.items()}\n",
    "\n",
    "#  Performance on training dataset        \n",
    "predLabels, correctLabels = tag_dataset(train_processed, model)        \n",
    "prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)\n",
    "print(\"train data precision: %.3f, Rec: %.3f, F1: %.3f\" % (prec_val, rec_val, f1_val))\n",
    "\n",
    "#  Performance on validation dataset        \n",
    "predLabels, correctLabels = tag_dataset(val_processed, model)        \n",
    "prec_val, rec_val, f1_val = compute_f1(predLabels, correctLabels, idx2Label)\n",
    "print(\"Validation data precision: %.3f, Rec: %.3f, F1: %.3f\" % (prec_val, rec_val, f1_val))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
